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ABSTRACT 


A  score  used  for  selection  or  classi¬ 
fication  should  predict  the  performance  of 
different  population  subgroups  equally 
well.  This  research  memorandum  analyzes 
the  prediction  of  hands-on  performance  in 
the  Automotive  Mechanic  specialty,  using 
the  Marine  Corps'  Mechanical  Maintenance 
(MM)  composite. 


EXECUTIVE  SUMMARY 


The  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  is  used  to 
select  and  classify  enlisted  personnel.  The  Armed  Forces  Qualification 
Test  is  used  to  select  personnel,  and  the  service  composites  are  used  to 
classify  them  into  occupational  specialties.  The  Marine  Corps  uses  the 
Mechanical  Maintenance  (MM)  composite  for  classifying  personnel  into 
occupations  involving  maintenance  and  repair  of  mechanical  systems. 

In  a  recent  report,  the  General  Accounting  Office  (GAO)  raised  some 
questions  about  the  fairness  of  composite  scores  used  by  services  for 
technical  occupations.  GAO  has  concluded  that  composites  are  less  suc¬ 
cessful  in  predicting  performance  of  women  and  minorities  than  they  are 
in  predicting  that  of  white  males,  especially  if  performance  is  measured 
in  the  field.  DOD  is  preparing  a  response  to  the  GAO  report,  based  on 
analyses  of  data  from  a  large  number  of  occupations  from  the  four 
services . 

In  the  MM  phase  of  its  Job  Performance  Measurement  (JPM)  project, 
the  Marine  Corps  has  developed  hands-on  performance  tests  (HOPTs)  for 
the  Automotive  Mechanic  specialty  (MOS  3521)  and  four  helicopter  repair 
specialties  (MOSs  6112  to  6115) .  The  content  of  each  test  was  based  on 
extensive  job  analysis  based  on  the  Individual  Training  Standards.  Each 
test  was  scored  by  former  Marines  who  had  experience  in  the  occupation 
and  who  had  been  trained  to  score  performance  as  objectively  as  possi¬ 
ble.  A  report  of  the  National  Academy  of  Sciences  calls  a  test  score 
obtained  in  this  manner  "the  benchmark  measure"  of  job  performance. 

This  study  analyzes  only  the  Automotive  Mechanic  data  because  sam¬ 
ple  sizes  in  the  others  were  too  small  for  useful  analysis.  Even  this 
occupation  had  few  women  and  Hispanics,  and  therefore  only  blacks  and 
whites  were  compared.  After  removing  cases  with  incomplete  data,  the 
sample  contained  118  blacks  and  632  whites. 

Fairness  of  the  MM  composite  means  that  a  specific  MM  score  pre¬ 
dicts  the  same  HOPT  score  for  all  individuals,  regardless  of  their  group 
membership.  This  similarity  of  predicted  scores  is  tested  via  regres¬ 
sion  analysis  in  which  the  slopes  of  the  prediction  equations  for  the 
two  groups  are  compared,  and  so  are  the  intercepts.  The  hypothesis  of 
fairness  was  tested  separately  for  two  MM  scores:  one  used  for  enlist¬ 
ment  in  the  Marine  Corps,  and  the  other  from  an  ASVAB  administered  con¬ 
currently  with  the  HOPT  as  part  of  the  JPM  project.  For  both  sources  of 
aptitude  information,  differences  between  blacks  and  whites  in  the 
slopes  and  intercepts  of  the  regression  lines  were  found  to  be  statisti¬ 
cally  nonsignificant. 

In  summary,  the  evidence  indicates  that  the  MM  composite  score  is 
equally  sensitive  for  both  subgroups  as  a  predictor  of  hands-on  perfor¬ 
mance  on  the  job.  In  addition,  it  does  not  underpredict  or  overpredict 
the  performance  of  either  subgroup. 
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INTRODUCTION 


The  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  is  used  to 
select  and  classify  enlisted  personnel.  It  contains  ten  subtests- - 
General  Science  (GS),  Arithmetic  Reasoning  (AR) ,  Word  Knowledge  (WK) , 
Paragraph  Comprehension  (PC) ,  Numerical  Operations  (NO) ,  Coding  Speed 
(CS) ,  Auto  and  Shop  Information  (AS),  Mathematics  Knowledge  (MK) , 
Mechanical  Comprehension  (MC) ,  and  Electronics  Information  (El).  The 
Verbal  (VE)  raw  score  is  defined  as  the  sum  of  WK  and  PC  scores.  Sub- 
tests  NO  and  CS  are  tests  of  speed  in  handling  numerical  and  symbolic 
material.  All  others  are  power  tests  with  liberal  time  limits.  Stan¬ 
dard  scores  rather  than  raw  scores  on  the  subtests  are  used  in  all  deci¬ 
sions  based  on  the  ASVAB.  Standard  scores  are  integers  from  20  to  80, 
with  a  mean  of  50  and  a  standard  deviation  of  10  in  the  1980  reference 
population. 

Standard  scores  from  certain  subtests  are  combined  to  compute  an 
individual's  Armed  Forces  Qualification  Test  (AFQT)  score,  which  is  the 
primary  score  used  to  select  individuals  for  military  service.  Compos¬ 
ite  scores  are  used  within  each  service  to  classify  a  recruit  into  a 
military  occupational  specialty  (MOS).  The  Marine  Corps  uses  four  com¬ 
posites:  Mechanical  Maintenance  (MM),  which  contains  AR,  AS,  MC  and  El; 
Clerical  (CL),  which  contains  VE,  MK,  and  CS ;  Electronics  (EL),  which 
contains  GS ,  AR,  MK,  and  El;  and  General  Technical  (GT) ,  which  contains 
VE,  AR,  and  MC.  Scores  on  these  composites  have  a  mean  of  100  and  a 
standard  deviation  of  20  in  the  reference  population. 

The  General  Accounting  Office  (GAO)  has  raised  some  questions  about 
the  fairness  of  service  composites  used  for  technical  specialties  [1] . 
According  to  the  Executive  Summary  of  GAO's  report, 

GAO  concluded  that,  for  most  recruits,  the  services' 
selection  criteria  are  moderately  successful  at  pre¬ 
dicting  individual  performance  during  classroom 
technical  training.  However,  they  are  notably  less 
successful  for  women  and  minority  recruits....  Only 
the  Army  systematically  collects  data  on  the  field 
performance  of  individual  graduates  in  a  way  that 
would  allow  comparison  of  a  graduate's  on-the-job 
performance  with  his  or  her  entry  level  ability  and 
classroom  performance.  These  data  reveal  an  even 
weaker  connection  for  women  and  minority  group  mem¬ 
bers  between  criteria  used  to  assign  them  to  techni¬ 
cal  specialties  and  their  later  field  performance 
[1.  P-  3]. 

Fairness  means  that  a  score  used  for  selection  or  classification 
predicts  the  same  performance  level  for  all  individuals  with  the  same 
score,  regardless  of  their  group  membership.  This  similarity  of  pre¬ 
dicted  scores  is  tested  via  regression  analysis,  in  which  the  slopes  of 
the  prediction  equations  for  the  two  groups  are  compared  and  then,  if 
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the  difference  is  nonsignificant,  intercepts  are  compared.  Comparabil¬ 
ity  of  slopes  from  the  separate  regressions  for  each  group  implies  equal 
sensitivity  of  predictors  Equality  of  intercepts  indicates  that  the 
test  does  not  underpredict  or  overpredict  the  performance  of  any  group. 
These  hypotheses  were  evaluated  by  using  the  MM  score  to  predict  scores 
on  a  hands-on  performance  test  (HOPT)  that  measured  proficiency  on 
representative  job  tasks.  The  MM  composite  is  used  for  occupations 
involving  mechanical  repair  and  maintenance . 

Figure  1  shows  the  regression  lines  for  a  test  that  is  fair  to 
groups  A  and  B.  The  comparison  of  the  slopes  determines  whether  the 
regression  lines  are  parallel.  The  second  significance  test  determines 
whether  the  regression  intercepts  are  significantly  different.  A  fair 
test  is  one  in  which  the  slopes  and  intercepts  for  the  two  groups  do  not 
differ  and  hence  the  lines  overlap.  Therefore,  all  a^titude^ scores 
result  in  equal  predicted  scores  for  the  two  groups  [Y(A)  -  Y(B)]. 


X(A)  =  X(B) 


Aptitude  test  score 


Figure  1.  Regression  lines  for  a  fair  test:  equal  slopes  and  intercepts 


DATA 


In  the  Mechanical  Maintenance  phase  of  its  Job  Performance  Measure¬ 
ment  (JPM)  project,  the  Maritj  Corps  developed  HOPTs  for  five  occupations 
for  which  MM  is  used  as  the  classification  composite.  These  are  the 
Automotive  Mechanic  specialty  (MOS  3521)  and  four  helicopter  specialties 
(CH-46 ,  MOS  6112;  CH-53A/D,  MOS  6113;  UH/AH,  MOS  6114;  and  CH-53E,  MOS 
6115) .  Each  test  consists  of  a  sample  of  tasks  that  a  mechanic  in  that 
specialty  needs  to  perform  in  the  course  of  his  or  her  work.  Require¬ 
ments  of  each  job  were  determined  using  the  Individual  Training  Standards 
of  the  Marine  Corps.  Each  task  was  divided  into  a  number  of  steps,  each 
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of  which  was  scored  as  performed  correctly  or  not.  The  test  was  adminis¬ 
tered  by  former  Marines  with  relevant  job  experience.  The  administrators 
were  trained  to  score  performance  as  objectively  as  possible  [2j.  A 
score  resulting  from  such  a  process  has  been  referred  to  as  the  "bench¬ 
mark  measure"  of  job  performance  [3,  p.  95]. 

A  data  set  was  constructed  for  each  MOS  containing  the  cases  for 
which  a  valid  HOPT  score  was  available.  The  largest  total  sample  size 
among  helicopter  MOSs  was  215  for  MOS  6114;  the  largest  minority  sample 
size  was  22;  and  no  women  were  in  the  MOSs.  Therefore,  analyses  were 
performed  only  for  the  Automotive  Mechanic  specialty. 

Time  in  service  (TIS)  exceeded  10  years  in  only  four  cases,  with 
values  ranging  from  136  to  160  months;  these  cases  were  excluded  as 
outliers  (i.e.,  cases  that  are  unusually  far  from  most  Marines).  The 
available  ASVAB  scores  are  those  with  which  the  Marine  enlisted,  and 
scores  from  a  computerized  adaptive  testing  (CAT)  version  of  the  ASVAB 
that  was  administered  concurrently  with  the  HOPT.  All  cases  with  missing 
enlistment  or  CAT  scores  were  deleted.  The  remaining  sample  contained 
only  44  women  and  83  Hispanics  (the  latter  number  being  distinctly 
smaller  than  the  sample  size  for  blacks).  These  groups  were  excluded 
from  the  study  because  the  sample  contained  too  few  of  them  for  useful 
analysis.  The  final  sample,  with  complete  data  for  each  Marine,  con¬ 
tained  118  blacks  and  632  whites. 

ANALYSES  AND  RESULTS 

TIS  is  a  powerful  predictor  of  hands-on  performance.  That  is, 
given  equal  ASVAB  scores,  senior  Marines  score  higher  on  the  average 
than  junior  ones  due  to  on-the-job  training.  The  rate  of  growth  slows 
as  time  increases.  Therefore,  TIS  and  its  square  were  included  as 
predictors  along  with  MM  scores. 

In  simple  regression  analyses,  outliers  are  usually  removed.  In 
multiple  regressions,  however,  this  simple  approach  can  be  Inadequate. 
Each  case  may  need  to  be  examined  in  terms  of  how  it  affects  the  esti¬ 
mates  of  the  the  regression  weights.  The  effect  is  quantified  as  fol¬ 
lows:  The  weights  are  estimated  using  the  entire  sample.  Then  they  are 
recomputed  with  one  observation  removed.  For  each  predictor,  the  latter 
estimate  is  subtracted  from  the  former,  and  the  difference  is  divided  by 
the  standard  error  of  the  estimate  [4] .  The  ratio  yields  the  "influ¬ 
ence"  of  the  observation  on  the  estimated  coefficient  of  the  predictor. 

A  large  value  of  either  sign  shows  that  the  observation  changes  the 
estimate  substantially,  aud  thus  behaves  like  an  outlier  in  simple 
regression. 

As  the  minority  sample  size  was  only  118,  a  few  influential  cases 
could  affect  the  result  substantially.  Therefore,  each  significance 
test  was  preceded  by  influence  analysis.  Cases  with  extreme  values  of 
the  influence  function  were  excluded,  and  then  a  significance  test  was 
performed  on  the  edited  sample . 
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Specifically,  let  us  consider  analysis  of  the  MM  score  used  for 
enlistment.  The  regression  equation  initially  included  a  term  to  repre 
sent  the  difference  between  slopes  for  blacks  and  chose  for  whites. 
Influence  on  this  term  was  calculated  for  all  individuals  in  the  sample 
The  standard  deviation  of  th  influence  values  was  .038,  and  the  mean 
was  zero  as  expected.  Four  examinees  with  influence  exceeding  .25  in 
magnitude  were  deleted  from  the  sample.  Using  the  edited  sample,  the 
F  ratio  for  difference  between  slopes  was  0.54,  which  is  statistically 
nonsignificant.  Therefore,  in  he  analysis  of  difference  between  inter 
cepts,  slopes  in  the  two  groups  were  set  to  be  equal.  Then  influence 
analysis  was  performed  for  difference  between  intercepts.  Standard 
deviation  of  influence  values  was  .041.  Again,  cases  with  influence 
above  .25  in  magnitude  were  deleted.  This  further  reduced  the  sample 
size  by  three.  The  F  ratio  for  difference  between  intercepts  was  3.62, 
which  is  not  significant  at  the  .05  level. 

A  similar  procedure  was  followed  with  the  MM  score  from  concurrent 
CAT-ASVAB.  The  cutoff  value  for  size  of  influence  was  again  .25.  Thre 
cases  were  deleted  for  the  analysis  of  slopes  and  two  more  for  the  anal 
ysis  of  intercepts.  Table  1  presents  detailed  results  for  enlistment 
and  CAT-ASVAB. 


Table  1.  Analyses  of  regression  slopes  and  intercepts 
using  enlistment  and  CAT-ASVAB  scores 


Enlistment 

CAT 

Blacks  Whites 

Blacks  Whites 

Slope 

Sample  sizes 
Estimates 

F  ratio 
Significance 

level 

114  632 

.22  .31 

0.54 
.46 

115  632 

.38  .35 

0.17 
.68 

Intercept 

Sample  sizes 
Estimates 

F  ratio 
Significance 

level 

111  632 

37.67  39.15 

3.62 
.057 

114  631 

32.70  34.09 

3.58 
.059 

Most  cases  excluded  due  to  extreme- influence  values  were  blacks. 
Because  the  black  sample  size  is  less  than  a  fifth  of  the  white  sample 
size,  a  black  individual  tends  to  influence  the  difference  between  sub¬ 
groups  more  than  a  white  individual. 
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DISCUSSION 


The  statistical  significance  of  the  intercept  differences  is  even 
weaker  than  it  appears.  Since  four  F  tests  have  been  performed,  a  .05 
significance  level  for  the  entire  set  of  tests  requires  that,  for  an 
individual  F  ratio  to  be  considered  significant,  its  tail  probability 
should  be  smallef  than  .05/4  -  .0125.  If  the  .05  significance  level  is 
applied  to  individual  F  tests,  the  overall  significance  level  is 
.05*4  -  .20.  Thus,  the  set  of  four  F  tests  reported  above  is  nonsig¬ 
nificant  at  the  .20  level. 

In  summary,  Marine  Corps  JPM  results  for  the  Automotive  Mechanic 
specialty,  using  the  hands-on  performance  test  as  the  criterion,  show 
that  the  Mechanical  Maintenance  composite  is  equally  sensitive  for 
blacks  and  whites.  The  results  also  show  that  the  regression  equation 
does  not  overpredict  or  underpredict  the  performance  of  blacks. 


-5- 


REFERENCES 


[1]  United  States  General  Accounting  Office,  Military  Training: 

Its  Effectiveness  for  Technical  Specialties  is  Unknown. 
Washington,  DC:  Government  Printing  Office,  Oct  1990 

[2]  CNA  Research  Memorandum  91-242,  Development  and  Scoring  of 
Hands-On  Performance  Tests  for  Mechanical  Maintenance  Special¬ 
ties,  by  Neil  B.  Carey  and  Paul  W.  Mayberry,  Mar  1992 

[3]  Alexandra  K.  Wigdor  &  Bert  F.  Green,  Jr.,  Eds.  Assessing  the 
Performance  of  Enlisted  Personnel :  Evaluation  of  a  Joint  Ser¬ 
vice  Research  Project.  Washington,  DC:  National  Academy  Press, 
1986 

[4]  D.  A.  Belsley,  E.  Kuh,  &  R.  E.  Welsh.  Regression  Diagnostics . 
New  York:  John  Wiley  &  Sons,  1980 


